URL as starting point for WWW document categorization
نویسندگان
چکیده
Information about the category (type) of a WWW page can be helpful for the user within search, filtering, as well as navigation tasks. We propose a multidimensional categorisation scheme, with bibliographic dimension as the primary one. We examine the possibilities and limits of performing such categorisation based on information extracted from URL, which is particularly useful for certain on-line applications such as meta-search or navigation support. In addition, we describe the problem of ambiguity of URL terms, and suggest a method for its partial overcoming by means of machine learning. As a side–effect, we show that general purpose WWW search engines can be used for providing input data for both human and computational analysis of the web.
منابع مشابه
Weight Adjustment Schemes for a Centroid Based Classifier Weight Adjustment Schemes for a Centroid Based Classifier Weight Adjustment Schemes for a Centroid Based Classifier *
In recent years we have seen a tremendous growth in the volume of text documents available on the Internet, digital libraries, news sources, and company-wide intra-nets. Automatic text categorization, which is the task of assigning text documents to pre-specified classes (topics or themes) of documents, is an important task that can help both in organizing as well as in finding information on t...
متن کاملWeight adjustment schemes for a centroid based classifier ∗
In recent years we have seen a tremendous growth in the volume of text documents available on the Internet, digital libraries, news sources, and company-wide intra-nets. Automatic text categorization, which is the task of assigning text documents to pre-specified classes (topics or themes) of documents, is an important task that can help both in organizing as well as in finding information on t...
متن کاملWeb-Specific Genre Visualization
User interfaces to WWW search engines typically present results as ranked lists of documents. Such lists give users little help in understanding document variation: we propose a richer representation of retrieval results in the search interface. Fundamental to us is the notion of document grouping. We use both stylistic genre-based document categorization and statistical content-based clusterin...
متن کاملA Domain Cluster Interface for WWW Search
Because of the recent explosive increase in the number of WWW documents, directory services are indispensable in finding needed documents. In the keyword search function of most directory services, search results are displayed as a URL list ordered by importance calculated by the system, but the order sometimes does not have any meaning to the user since the calculation algorithm is a black box...
متن کاملCentroid-Based Document Classification: Analysis & Experimental Results
In recent years we have seen a tremendous growth in the volume of text documents available on the Internet, digital libraries, news sources, and company-wide intranets. Automatic text categorization, which is the task of assigning text documents to pre-specified classes of documents, is an important task that can help both in organizing as well as in finding information on these huge resources....
متن کامل